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[57] ABSTRACT 

A method of performing a parallel join operation on a pair 
of relations Rl and R2 in a system containing P processors 
organized into Q clusters of P/Q processon each. The 
system contains disk storage f<x^ each cluster, shared by the 
processors of that cluster, together with a shared intermedi- 
ate memory (SIM) accessible by all processon. The rela- 
tions Rl and R2 to be joined are first sorted on the join 
column. The underlying domain of the join column is then 
partitioned into P ranges of equal size. Each range is further 
divided into M subranges of progressively decreasing size to 
create MP task^ T^, the subranges of a given range being 
so sized relative to one another that the estimated comple- 
tion time for taskT,^ is a predetemoiDcd fraction diat of task 
Tasks T^ with larger time estimates are assigned 
(and the corresponding tuples shipped) to the cluster to 
which processor p belongs, while tasks with smaller time 
estimates arc assigned to the SIM, which is regarded as a 
universal cluster (duster 0). The acmal task-to-processor 
assignments arc determined dynamically during the join 
phase in accordance with the dynamic longest processing 
time first (DLFT) algorithm. Each processor within a cluster 
picks Its next task at any given decision point to be the one 
with the largest time estimate which is owned by that cluster 
or by cluster 0. 
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METHOD OF PERFORMING A PARALLEL 
RELATIONAL DATABASE QUERY IN A 
MULTIPROCESSOR ENVIRONMENT 

CROSS-REFERENCE TO RELATED j 
AFPUCAnONS 

This application is related to the foUowing commonly 
owned coocurreotly filed applications, the specifications of 
which are incorporated herein by reference: 

T. Borden. L S. Narang, D. B. Rathi and D. J. Wisneski, 10 
"System and Method for Parallel Ptocessing of Com- 
plex Read-Only DaUbase Queries", Scr. No. 08/148, 
091. now U.S. Pat No. 5,495,606; 

J. L. Wolf. P. S. Yu. and J. J. Turek, 'Task Sdieduler for 
a Multiprocessor System", Serial No. 08/148,108, now 
abandoned in favor of continuation application Ser. Na 
08/293,257, filed Aug. 19, 1994, now U.S. Pat. No. 
5.437,032. 

BACKGROUND OF THE INVENTION 20 
Field of the Invention 

This invention relates generally to a method of perform- 
ing a parallel query in a multiprocessor environment and, 
more particularly, to a method for performing such a query 
with load balancing in an environment with shared disk ^ 
dusters, shared intermediate memory or both. 

Description of the Related Art 

A common operation in relational database systems is &e 
join of two relations on respective columns defined over a ^ 
common domain. See. for example, the descr^on of joins 
in C. J. Date, An Intrvduction to Database Systems^ vol. 1 
(4th ed. 1986), at pp, 132-136, 266-268, 341-343 and 
348-349. The result of the join is a new relation in which 
each row is the concatenation of two rows, one from each of 
the origioal relations, such &at both rows have &e same 
value in their respective join coluains. 

One popular algorithm for con^>uting the join of two 
relations is the sort meige technique as described by M. W. 
Blasgen et al. in ''Storage and Access in Relational 40 
Databases". IBM Systems Journal, vol. 16. no. 4, pp. 
363-377 (1977). It can be summarized briefly as follows: 
First, each of the relations is soned (if necessary) according 
to the join column. Second, the two sorted relations arc 
scanned in the obvious Interlocked sequence and meiged for 43 
rows which have equal values. In a multiprocessor system, 
diis algorithm may be inqilemented by partitioning the join 
into independent tasks which are performed in parallel by 
the proccsscH^. 

Another pqMilar algorithm for computing the join of two so 
relations is the hash join technique described by D. J. De^tt 
ct aL in ''Multiprocessor Hash-Based Jdn Algorithms**, 
Proceedings of the Nth International Conference on Very 
Large Databases, pp. 151-164 (1985). For a multqirocessor 
system, it can be summarized briefly as follows: First, both 33 
relations are hashed (if necessaiy) into hash partitions 
according to the join columns. The number of hash partitions 
generally is s^ equal to the number of processors. Then the 
hash partitions are distributed among the processors, so that 
the corresponding partitions of the two relations reside on 60 
the same processor. The corresponding hash partitions of the 
two relations are then joined together. 

Although the execution of a join query using either of 
these methods may be sped up by the use of multiple 
processors, the speedup can be very limited in the presence 65 
of data skew, as may t>e appreciated from the following 
cxainple of a parallel system architecture. 



146 

2 

Referring to FIG. 1, consider a database architecture 100 
consisting of P processors 104 in which there is sharing of 
either or both of the following types: 

1. The ]HX)cessors 104 are divided into Q^P equal-size 
clusters 102« with F/Q processors per cluster. Within each 
dusta 102, ail disks 106 are shared by ail of the proces- 
sors 104. (The case Q=l corresponds to a data-sharing 
architecture, and the case B=Q would correspond to a 
shared-nothing architecture. When 1<Q<P the ardiitec- 
ture can be characterized as hybrid) 

2. There is a single shared intermediate memory (SIM) 108 
accessible by all of the processors. 

FIG. 1 depicts an architecture in which both of the above 
types of sharing are present Suppose It is desired to perform 
a parallel join or other parallel query in this environment 
Without knowing the underlying distribution of tuples in the 
two relations being joined. It is impossible to partition the 
underlying domain into tasks in sudi a way &at the load in 
the final (join) phase is guaranteed to be balanced. Even if 
the actual amount of work for each taskh^)pens to be equal, 
there may be stochastic variations in individual processor 
speeds due to external events, and this nonhomogeneity will 
also lead to load imbalances. 

D. M. Dias et al., in U.S. Pat No. 5,121,494 entiUed 
"Joioing Two Database Relations on a Common Field in a 
Parallel Relational Database Field" (Jun. 9, 1992), disclose 
a method of parallelizing a join operation in a shared- 
nothing architecture in which additional subtasks are created 
by partitioning existing tasks and assigned to less busy 
processors if a load imbalance is sensed. Similar methods 
are described in the f cowing references: 
J. L. Wolf et al., "An Effective Algorithm for Parallelizing 
Sort Merge Joins in the lYesence of Data Skew"^, 
Proceedings of the Second International Symposium on 
Databases in Parallel and Distributed Systems, pp. 
103-115 (1990); 
D. M. Dias et aL, ''Methods for Improving the Efficiency 
of Parallel Son Merge Joins in the Presence of Data 
Skew**, IBM Technical Disclosure Bulletin, vol. 33. no, 
lOA, pp. 166-170 (March 1991); 
J. Wolf et at, **An Effective Algorithm for Parallelizing 
Hash Joins in the Presence of Data Skew**. Proceedings 
of the Seventh International Conference on Data 
Engineering, pp. 200-209 (1991); 
J. L. Wolf et aL, "AParallel Sort Merge Join Algorithm for 
Managing Data Skew**, IEEE Transactions on Parallel 
and Distributed Systems, vol. 4, no. 1, pp. 70-86 
(January 1993). 
While the techniques described in these references do 
ameliorate the problem of data skew, it would be desirable 
to have a parallclization method which does not require the 
creation of new tasks. 

SUMMARY OF THE INVENTnON 

The Invention described here relates to a parallel query 
method which has a high likelihood of balancing the load 
well in the face of either initial load imbalance due to data 
skew or later load imbalance due to stochastic process 
variations. The invention is described in the context of a sosi 
merge join. However, the same basic metfiod can also be 
applied to hash joins, softs, or other queries in a natural 
manner. 

In a parallel sort merge join, the relations to be joined arc 
first sorted. In parallel, within their clusters 102 (HG. 1). In 
a naive parallel sort merge join, the undertying join column 
domain might be partitioned Into P ranges of equal size, and 
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the tuples transferred accordingly among the clusters 102. 
However, given a nonuniform distribution of tuples aaoss 
the underlying domain* there is no guarantee that the amount 
of join phase work will be equaL 

Iq accordance with the present invention, each of the P 
ranges is furtiier divided into a relatively small number M of 
components, creating MP tasks in all. These con^x)- 
nents intentionally have nonequal task time estimates. For 
example, a reasonable approach would t>e to partition the 
tasks so that the estimated completion time of a taskT„^ is 
half that of tiie previous Usk T^i^. Assuming that the 
quadratic output term dominates the task time estimates^ this 
can be done by partitioning the tasks in such a manner that 
the extent of the range of a given task (to which the 
number of tuples in the usk is roughly projportional) is 17^2 
times the number of tuples in Usk T^x^. FIGS. IDA and 
lOB show an example of such a partitioning. FIG. lOA 
shows estimated Usk times as a function of m and p. and 
FIG. lOB shows actual task times, also as a function m 
and p. The latter may be different from the former, and will 
not be known until the join phase, when the tasks are 
acmally performed 

Cluster ownership of these tasks is assigned as follows: 
The tasks with larger time estimates (that is, those tasks 
with small values of m) are assigned to the duster 102 to 
which processor p belongs. Thus, tasks T^ with small m 
satisfying p=l will t»e assigned to cluster 1, while tasks with 
small m satisfying p=P will be assigned to cluster Q. The 
tasks with smaller time estimates (large values of m) arc 
assigned to SIM lOS, which constitutes a universal cluster 
which is labeled cluster 0. The rule for determining which 
tasks are small and which are large d^>ends on whether or 
not they fit into the SIM 108, Specifically, 1/P of the SIM 
108 is allotted to eadi prooesscH* 108, and during the transfo 
phase the tuples are shipped to the SIM in order of increas- 
ing task time estimates until the allotted portion of &e SIM 
is filled. Remaining mples are shipped to a disk 106 within 
the cluster 102 that owns them. All tasks which fit com- 
pletely within the SIM 108 are regarded as small, and the 
remainder are regarded as large. 

In FIGS. lOA and lOB, the "dotted** Usks arc owned by 
duster 0, while the "shaded" tasks are owned by their 
respective clusters. Note again that the high values of m 
correspond to cluster 0» while the low values ccHTCSpond to 
the other clusters. 

The actual Usk-to^)!rocessor assignments are determined 
dynamically during the join phase, according to the follow- 
ing rule, which is a generalization of the standard dynamic 
longest processing first (DUT) algorithm as described, for 
example, in T. Hu. Combinatorial Aigorithms (19S2). Each 
processor 104 within a cluster 102 pidES its next task at any 
given decision point to be the one with the largest time 
estimate which is owned by that duster or by the universal 
duster 108. (A dedsion point occurs at initiation time or 
whenever a current Usk conq;>letes.) Thus it can be assumed 
that each task Tj^ is performed by the corresponding 
processor p. After those tasks con^lete, the next Usks are 
picked dynamically. Although the task times may not be 
estimated with perfect precision and the speeds of tiie 
processors 104 may not be entirdy homogeneous, the inven- 
tion coupled wiA the flexibility inherent in the sharing 
provides a mechanism for limiting the join phase load 
imbalance. 

Any daUbase system supporting complex queries and 
employing dustered disk sharing and/or a shared interme- 
diate memory can make use of this invention. There are no 
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Special i^^)lemcntation requirements other than the ability to 
create tasks and assign them dynamically. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a schematic block diagram of a multiprocessor 
system incorporating die present invention. 

FIGS. 2 and 3 illustrate a pair of relations which may be 
joined using the present invention. 
10 FIG. 4. illustrates the partitioning of a relation into a 
plurality of ranges comprising subranges of progressively 
decreasing extent 

FIG. 5 is a flowchart of the overall sequence of operations 
of the system shown in FIG. 1. 

FIG. 6 is a flowchait of the sorting phase of Ihc sequence 
shown in FIG. 5. 

FIG. 7 is a flowchart of the partition phase of the sequence 
shown in FIG. 5. 
^ FIG, 8 is a flowchart of the transfer phase of the sequence 
shown in FIG. 5. 

FIG. 9 Is a flowdiart of the join phase of the sequence 
shown in FIG, 5. 
FIG. lOA is a three-dimensional (3D) graph of the esti- 
2^ mated task completion time as it varies among tasks. 

FIG. lOB is a 3D graph of die actual task completion time 
as it varies among tasks. 
FIG. 11 is a Venn diagram of the K most frequent titles 
^ in each of a pair of relations Rl and R2. 

FIG. 12 is a flowdiart of die o|)eration of a particular 
partitioning mechanism for partitioning an original query 
into sets of independent tasks of decreasing estimated execu- 
tion time. 

DESCRIPnON OF THE PREFERRED 
EMBODIMENT 

General System Architecture 

40 Referring to FIG. 1, a multqaxxxssor system 100 incor- 
porating the present invention includes P processors 104 
organized into Q equal-size clusters 102. each cluster con- 
taining P/Q iwocessors. Each jwoccssor 104 may be either a 
uniprocessor or a con^)lex of ti^Uy coupled processors (not 
45 separately shown) that, for the purposes of task assignment 
are regarded as a single processor. Each duster 102 also 
indudcs one or more direct access storage devices (DASD) 
106, which are magnetic disk drives in the system 100 
shown. Each processor 104 within a cluster 102 can access 
50 any stc^age device 106 In the same cluster, but cannot access 
any storage device in any other duster 102, ftocessors 104 
are interconnected to one another as well as to a single 
intermediate menwry (SIM) 108, to which each processor 
has access. SIM 108 is also refeired to herein as the 
55 universal cluster, or duster 0. In addition to the memory 108 
and storage devices 106 shown, each processes 104 also has 
its own T"«'" memory (not sq;>arately shown). In the case of 
a processor 104 comjnising a tightly coupled processor 
con:q>lcx. such main memory would be shared by the pro- 
60 cessOTs of the complex. The elements shown in FIG. 1 are 
conventional in the art, as are the interconnections between 
these dements. 

Processors 104 arc used for the concurrent parailcl execu- 
tion of tii.<tif.«) making up daUbase queries, as described 
65 below. A query may originate dther from one of the pro- 
cessors 104 or from a separate front-end query processor as 
described in the concurrently filed application of T. Borden 
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et al„ now U.S. Pa. No. 5,495,606. As further described in 
that application, within each cluster 102 the query splittiog 
and scheduling steps described below may be performed by 
an additional processor or processors (not shown) similar to 
processors 104\ such additional processors would not be 
counted among the P/Q processors 104 per con^lex 102 to 
which tasks are assigned. 

Query Elements 

Referring to FIGS, 2 and 3, a typical query may comprise 
a join operation performed on two database tables (or 
relations, in relational database terminology) such as tables 
200 (EMPLOYEE) and 300 (DEFT). Table 200 
(EMPLOYEE) lists for each en^loyce of an oiganization 
the employee's name (NAME), ttie employee's department 
(DEPT) and the employee's salary (SALARY). Table 300 
lists for each department of the organization the department 
name (DEPT) and the name of the manager of that dq>art- 
ment (MGR). Each table con^irises a plurality of rows* also 
referred to as reccrds or tuples. Each row in turn comprises 
one ca- more columns, also referred to as fields or attributes. 
Whereas die number of columns in a relation remains fixed, 
the number of rows (known as the cardinality of the relation) 
may vary as particular tuples are added or deleted. In the 
system 100, the tuples making up the database tables such as 
tables 200 and 300 are distributed among the storage devices 
106 of the various clusters 102. so that different parts of each 
table geno-ally reside on different clusters. 

Each column of relations 200 and 300 has an underlying 
domain^ which consists of die set of possible values far diat 
column, irrespective of whedier any tuples the relation 
actually have that value in ttt column. For die relatioDs 200 
and 300 shown, the domains of each column would extend 
from A(one letter) to 777. , . . ZZZ (n letters, where n is the 
field size). Similariy, a numerical column (not shown in 
FIGS. 2 and 3) might have a domain extending between two 
numerical values. In either case, since the value are ordered 
(alphabedcaily or numerically), it is possible to partition the 
domain (as described below) into ranges of contiguous 
values, such that each value (and hence die tuple containing 
that value) may be assigned to one and only one range. 

In this particular example, it will be assumed that the 
query requests a listing by depaitments of department man- 
ager name, dq)artment and department manager's salary, 43 
ordered by department maiuiger name. Expressed in Struc- 
tured Query Language (SQL), a language widely used for 
database queries, this query takes the form: 
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line 2 specifics that these columns arc taken from the set 
of all tuples formed by concatenating a tuple frxsm table 200 
(EMPLOYEE) with a tuple from tabic 300 (DEFT). This set 
of tuples is known as the (Cartesian product of tables 200 and 
300. 

Line 3 of the query, which is refened to as the predicate 
(the join predicate in this instance, since die query is a join), 
specifies that only those tuples of the Cartesian product for 
which the NAME column of the EMPLOYEE con[q)oacDt is 
equal to the MGR column of the DEFT component are 
selected. Since in this case the join predicate specifies an 
equality, this type of query is known as an equijoin. 

Finally, line 4 specifies that the output listing is to be 
sorted by the NAME ccdumn of the output lisdng. 

In ficncial. Query 1 may be partitioned into concurre ntly 
e xecuted tasks by having each processor 104 process only 
t hat part of the query correspondin g to particular valu es of 
one" cA the Columns Of a tableTfor example, table 200 
f g^PCOYKEt. More specUicaUv. Ouerv 1 mav be part i- 
ti oned into tasks correspondin g to n py ticular rangp f^ f p ne 
rPlinr nS THuc if the underlyinfJi dnmain 

ofU h e NAME xQluiBn ^tablc 200 (EMPLOYEE) exte nds 
from A from 777 777. ^ginal query may he xplit 
into 26 independent tasks for concurrent execution, so th at 
th e first task, for example, might be as fol lows: 

TABLE 2 

1 SELECT NAME« EMFLOYEEJ)EPT, SALARY 

2 FROM EMF1X>VEE, DEBT 

3 WHERE NAME =MGR 
3a ANDA£NAME<B 

4 ORDER BY NAME 



35 



40 



Lines 1-3 aAd4 of task 1 are similar to lines 1*4 of query 
1. line 3c adds another condition to the predicate, namely, 
that only those niples of the EMPLOYEE table for which 
NAME begins with A are considered. 

The two conditions on lines 3-^ taken together imply 
that MGR is subject to the same range condition as NAME. 
Therefore, instead of having to consider the entire DEFT 
table, only those tuples for which A^MGR<B need be 
considered in this paiticular task. Task 1 might be restated. 
&erefore« as follows: 



TABLE 1 



Quay 1 



1 SELECT NAME, EMPLOYEE DEFT, SALARY 

2 FROM EMPLOYEE, DEFT 

3 WHERE NAME = MOR 

4 ORDER BY NAME 



1 SELECT NAME. EMPIOYHEDEPT. SALARY 

2 FROM EMPLOYEE, DEPT 
^ 3 WHERE NAME = MOR 

3ft ANDASNAME<B 

3b AND A S MOR < 6 

4 ORDER BY NAME 



55 



Line 1 of this query specifies the colunuu requested as 
output: in this case, the NAME, DEPT and SALARY 
columns of table 200 (EMPLOYEE). The notation 60 
EMPLOYERDEPT means diat the DEPT column is to be 
taken from the EMPLOYEE table; otherwise, since DEPT 
appeals in both tables, the specification would be ambigu- 
ous. The particular table from which the NAME and SAI^ 
ARY columns are taken need not be specified (althou^ it 65 
may be specified), since these colunm names appear only in 
the EMPLOYEE table and hence there is no ainbiguity. 



Task 1 is equivalent to the original query as performed on 
subsets (or partitions) of the EMPLOYEE and DEPT tables 
200 and 300 in which NAME and MGR begin with the letter 
A. 

Other tasks into which Qacry 1 is divided would be 
similar in form to task 1. except that die additional condition 
would reference a different part of die EMPLOYEE and 
DEPT tables. Thus, task 2 might be: 



1 SELECT NAME, EMFX^OYEEDEFT, SALARY 

2 FROM EMRLOYEB, UEFi 
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-continued 

3 WHERB NAME = MGR 
aa AND B £ NAME < C 
3b AND B S MGR < C 

4 ORDER BY NAME 
while task 26 might be: 

1 SELECT NAME, EMPLOYEEDEPT, SALARY 

2 FROM EMSLOYBB, DEFT 

3 WHERE NAME ~ MGR 
3a AND NAME SZ 

3b AND MGR £Z 

4 ORDER BY NAME 



It may be readily verified that these tasks arc mutually 
exclusive and that their results, when merged, replicate the 
results that would be obtained by performing the original 
unpartitioned query. 

As can be inferred from the above cxaiaplt, die tasks into 
which tables 200 and 300 are paititioDcd, using a partition- 
ing scheme (rf one task per processor, can vary widely in the 
number of tuples. For cxan^)le, relatively few if any names 
start with X, whereas a great many may start with other 
letters such as E. Similar skew patterns may exist fot oUicr 
types of data (eg*, zip codes). Since die query completion 
time is determined by the time required to complete the 
longest task, such data skew can greatly prolong the query 
time, nullifying the benelits of parallelization. 

Partitioning Procedure 

Referring to FIG. 4, in accordance wiih the present 
invention, each table or relation 400 fonning part of the 
query is first partitioned into P ranges p (402) of equal size, 
where P is the total number of processors 104 in the system 
100 and l^p^P. This partitioning is preferably pafonned 
on die basis of the underlying domain of a particular column 
of the relation (preferably the join column), not the acmal 
number of tuples in a particular range. Thus, if NAME and 
MGR are the respective join columns for joining the 
EMPIjOYEE table 200 (FIG. Z) and the DEPT tabic 300 
(FIG. 3). as in the above example, die EMPLOYEE table 
may be divided into a first range consisting of those tuples 
for which NAME begins with the letter "A", a second range 
consisting erf diose tuples fo£ which NAME begins witfi the 
letter 'B", and so on, as already described. Correspondingly, 
the DEFT table 300 would be divided into a first range 
consisting of those tiq)lcs for which MGR begins witti the 
letter '^A**. a second range consisting of those tuples for 
which MGR begins with tfie letter **B", and so on, as already 
described. 

Next, each range p (402) into which a relation 400 
fonning part of the query has been partitioned is further 
divided into M subranges 404 to form tasks T^ of pro- 
gressively decreasing estimated task time, where l^m^M. 
This division into subranges 404. like the original division 
into ranges 402, is preferably performed on the basis of the 
imderlying column domain, not the actual number of titles 
in a particular range. As a particular example, the ranges 402 
might be divided in such a manner diat cadi subrange m 
(404) has an extent in the underlying column domain that is 
1/V2 that of the preceding subrange m-1. Since the comple- 
tion time for a join of two relations is roughly proportional 
to the product of the number of tiq>les in each relation, such 
a partitioning results in an estimated conqdetion time for a 
taskT^ that is approximately half that of the preceding task 

In the event diat additional tuple cardinality infc^roation 
is available, the creation of these tasks can be slightly 
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modified. For exan^ie. some database catalogs keep track of 
the cardinalities of the K most frcqucntiy occurring tuples 
for some small value of K. Thus, referring to FIG. U, such 
a catalog may keep track of die K most frcquendy occuning 

5 tuples 1102 in a first relation Rl, as well as die K most 
frequentiy occurring tuples 1104 in a second relation R2. If 
ttiis information is available, there will be K' frequently 
occuning values 1108 (where O^K'^K) for whidi die tuple 
cardinalities of bodi relations arc known, and 2(K-K') 

10 values 1106, 1110 for which die tuple cardinality of one 
relation is known and ttie other is bounded from above. 
These latter values can eidier be estimated or determined 
explicitly during the sort phase, and those values with large 
enough task times turned into separate tasks. These tasks can 

15 even be split into mult^jle subtasks to be pcrfOTmed on 
several processors if necessary, as is described in J. L. Wolf 
ct aL (1990) and J. L. Wolf ct al. (1991), cited above. 

Processors 104 arc initially assigned those Usks having 
the largest estimated completion times. Upon the completion 

20 of a task by a processor, it is assigned an awaiting task 
having the largest estimated completion time, so that die 
tasks having the smallest estimated conviction times are die 
last to be assigned. This allows die smaller tasks to be 
assigned in such a manner as to smooth out load imbalances 

2^ that may develop among die processors 104. At the same, 
having die initially perfcrroed tasks relatively large mini- 
mizes scheduling overhead. 

Overall Query Procedure 

^ ^efe ging to FIG . 5i dio overall query proc e dure i s con- 
ve nientlv divided into four successive phas^ iA-^^^^phase 
"^iO ^ a partition phase 520. a transf er phasft «"d-fl join 
ph ase 540. Each of tt iese phases is performed in paraU el 
w idiin die dustPTs 102 (FIG. with each cluster ixaf flim- 
ing itfi pmti^ fff a rffrti^^i'^T phaof^ it relates to the tuples 
wittiin die cluster. 

Rcfciring tolFIG. 6, in die sort phase 510, the mplcs of 
each relation forming part of the query that are locally stcHed 

^ on a particular cluster 102 arc sorted (if necessary) in 
accordance with the value in the column forming the basis 
for partitioning (step 512); In most instances, this column is 
preferably the join column. This sorting facilitates the par- 
titioning of the relations in the next phase as weU as their 

43 subsequent transfer, since the titles of each partition are 
consecutively accessible. 

Refeiring to FIGS. 4 and 7. in die partition phase 520. die 
join request is partitioned into P sc^s of M tasks T^ of 
progressively decreasing estimated task time, for a total of 

50 MP tasks overall In die case of an cquijoin quer^ (as in die 
above example), this may be accon^lishcd by partitioning 
the join column domain of each relation 400 forming part of 
the query into P ranges p of equal size, as described above 
(st^ 522), and dien further dividing each such range p into 

35 M subrai^es 404, also as described above (step 524). In 
general, however, any of a ntimba of mechanisms may be 
used to partition the original query into sets of independent 
tasks of decreasing estimated task time for concurrent 
execution. A particular mechanism Is described further 

^0 below. 

Referring now to FIG. 8, in the transfer phase 530. die 
portions of die tasks T^ resident on a given duster 102 
(Ic.. die actual tuples forming the corresponding partitions 
of the relations to be joined) are transferred to the clusters 
65 diat will actually be processing tiiem, or to die universal 
cluster (SIM) 108 (FIG. 1), depending on the relative task 
size. In general, as described more fully 10 below, "smaller^ 
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tasks are transfeired to the universal duster 108. while 4ata maoagement instances such as a relational database 

"larger** tasks are transfcired to aparticular processor cluster which store data, such as relatloaai tables and pox^vide a 

102. language such as SQL - see D. Chamberlin et al.. *^EQUEL 

Each proce5S<H^ 104 of the system 100 is allotted an equal 2: A Uniiied Approach to Data Definition. Manipulation, and 

portion 1/P of the memory capacity of universal is cluster s Contror, IBhf Journal of Research and Development^vol. 

108. In the initial portion of the transfer phase, for each 20. no. 6. pp. 560-575 (November 1976), for a Iwthcr 

processor p (104) of the system 100. the tasks coat- discussion of SQL — for issuing requests to retrieve such 

sponding to that processor and residing on a particular data. For the purposes of explanation, in the description 

cluster 102 are uransferred from that cluster to the universal below the SQL language will be used for examples and 

cluster 108. beginning with the taskT^^ having the smallest |o relational database tcnninolpgy will be used; however, the 

estimated completion time and progressing in order of procedure is not Limited to the relational database sc<^. The 

increasing task size (ie.. decreasing m), until the allotted SQL request is denoted as R. and the relational database 

portion 1/P is filled (step 532). The remaining tasks for instances as DBp DB^. and so on. The request R may 

each processor p (104) are transferred to the cluster 102 involve data stored at any subset or all of DB^ . . . DB/^. for 

owning the processor, unless they are already resident there ^5 any number N. Each database provides, upon request, cata- 

(step 534). log information about the data that it stores. All tables are 

Referring to FIG. 9. in the join phase 540, each processor accessible from any of the DB, instances, either through 

p (104) within a particular cluster is initially assigned task multiple processing unit access to the same data - one 

T,^ (step 542). The processors 104 perform the tasks exan^lc is shared data mechanisms such as described in C. 

assigned to them in a conventional manner that 30 does not 20 '^Parallelism in Relational Data Base Systems: 

form part of die present invention. Subsequent tasks Architectural Issues and Design ApfHXiachcs". IEEE Pro- 

where m>l. arc assigned dynamically as die initially ceedings of Distributed and Parallel systems conference, 

assigned tasks are coii4>leted. Thus, upon the completion of Dublin. Ireland (July 1990) — or through refloated cc^ies as 

a task (step 544), a determination is made of whether there described in Proceedings of the Second International Con- 

are any remaining tasks in that cluster 102 that have not been 25 A"^* Parallel and Distributed Information Systems, 

assigned (step 546). If so, then an available processor 104 in San Diego. IEEE Computer Society Press (January 1993) 

the cluster 102 (eg., the processor that was executing the (hereinafta Proceedings 199 J^. 

just-completed task) is assigned the task T^ from the The join request R for data stored in multiple tables is 

cluster with the longest estimated completion time (step assumed to have the following form or to be transformable 

548). In these subsequent assignments, the task T,„^ 30 through various means to such a form: 
assigned a processor 104 need not have the same index p as 

the processor; the task need only be one of the tasks that was 

transferred to the cluster 102 in the transfer phase. The task ^SnS^f^^* 

assignments are in this sense dynamic, since only the clusts ^^^^ 

102. not the processor 104. is predetermined (in the transfer 33 
idiase). 

If there are noremaining tasks in the cluster 102 that have where (expressions) arc any computation s"PP«t«»^ 

not been assigned, a deter^dnation is made of whether there '*?."f .^^e and (rest of query) is an optiooal phnse. 

areanyremainingtasksinlheunivffsalclustal08thathave whid>. ifpresent descnbcs any quaMcahons alx«t Ihe dau 

not bLn assigned (step 550). If so, then u. available « ^Z.'^TT^s'^T'^ l^J^''^ ^' 

processor 104 in the clusta 102 is assigned the task K""***- "^^M (list of tables) provides m the hst of tables the 

from the universal cluster 108 with thelongest estimateS ^ * J*"!* 

completion time (step 552). msc'^lw are denoted as T,. T, T, .. . 

. / » 1 . .1. *u *rc expected to be among those stored by DBi, . . . DB/,. 

If there are no remaining tasks in either the processor ^ »»^k ^^.i.^ jc L» 

duster 102 or the universal cluster 108. then sdi^g is 4S J°»» « fo\S^. 



complete for that query, and the only remaining item in this 

phase is the actual completion of task processing by the select name, salary, iwino 

various processors 104. together with any required post- from EMF1X)YEE, deft 

processing steps such as merging the results of the individual SSsR ST^™"^^ " DEPTi»4GRENO 

tasks and rqxirting die query results to the user. 50 



The invention has been described with particular refer- 
ence to a sort merge join. However, as noted above the same Referring to FIG. 12. upon receipt of the original data 
principles can also be qjplied to other query types such as request R, the QP performs the following steps; 
a hash join. Thus, in a hash join, each relation might be l.ThcQPexaminesthcFROM(listof tables) contents of the 
correspondingly hashed into P partitions, each of which is in 55 request and estimates the cost of providing the answer to 
turn furlfaci divided into M subpartitions of decreasing size query (Stq) 1202). Such an estimate could be pro- 
to create MP independent tasks whic^ are transferred to the vided from many mechanisms that are not part of this 
various clustas 102 and 108. invention. One ahcmative is to execute the quay and 
... . meastire elapsed time, or CPU time* or I/O time, or any 
Particular Partitionmg Mcdianism ^ combination. Another alternative is to measure or csti- 

As noted above, any of a numbcx of mechanisms may be mate any quantity of in^Ksitance to the system designer or 

used to partition an original query R into sets of independent the requester, such as response time; such a measurement 

tasks of decreasing estimated task time for cotkcunent coulddi£ferby type of data oa* requester or time of day. etc. 

execution. A particular mechanism in the form of a query Still anodier alternative would be to analyze the request 

processor (QP) is described below. 6S using known mechanisms such as those in P. Selingcr et 

The problem domain assumes a requester which issues al.. "Access Path Selection in a Relational Database 

requests far data, in a language such as SQL and one or more Management System**. Proceedings of ACM SIGMOD 
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Cor^ermcc. pp. 23-34 (June 1979), or P. G. Selingcr ct 
aL, "Access Path Selection in Distributed Database Man- 
agemeot Systems". Proceedings International Cortfer- 
ence on Data Bases, Dccn and Hammersly (eds.). Uni- 
versity of Aberdeen, pp. 204-215 (July 1980), also 
avaUable as IBM Research Report RJ28S3 (August 1980), 
resulting in ao estimated total cost in tcnm of an arith- 
metic combination of CPU and I/O and possibly message 
costs. 

2. The QP determines the contribution of each table in the lo 
FROM list of tables to the ovaall cost obtained in sXcp 1, 
using any one of a variety of techniques (step 1204). One 
such tedinique is to use database tools, such as the 
EXPLAIN mechanism of the IBM DATABASE 2™ 
(DB®) 3.1 relational database manager, desaibcd in the is 
IBM manual DATABASE 2 Version 3.1 General Informa- 
tion (GC26-4886-00), which iMX)Yides a description of the 
execution of the request plus the costs associated with 
each table. (DATABASE 2 is a trademark, and IBM and 
DB2 arc registered trademarks, of IBM Corporation.) An 20 
alternative technique would be to construct in the QP a 
cost analyzer that formulates its own estimates of cost and 

to calculate and record the contribution of each table T, to 
the ovaall cost 

3. The QP constructs a list Lj, Lj^ of the tables in 25 

the request in order of their cost coatributioD to the overall 
cost of the request, with the roost costly as (step 1206). 

4. Next die QP identifies the tables whidi pwrnit partitioned 
access (step 1208). This determinatioD can be done with 
various mechanisms such as using the catalogs of DB^, . 
. . DB/^, or any extract of those catalogs whidi QP itself 
may keep. In some implciDentations, all tables may be 
partitioned, while in others only certain tables may be 
partitioned. By partitioned access is meant being able to 
access only a portion of the entire table using database 35 
mechaziisms. Such mechanisms include indexes or scans 
on subsets of the table, such as die partitions of DB203.1 
described in the above-identified manual the ROWID 
predicates of the Oracle relatioDal database manager as 
described in Proceedings 1993. cited above. 

5. The QP chooses as the table to be split die first table in the 

list L| , L2 pormits partitioned access (step 

1210). This table is caUed T^ 

6. Each such T^ will have a means of expressing the 



denoted as P^ i=l . . . NL. Using as an example the query 
given in the background section, die EMPLOYEE table 
might have partitioned access using the DEFTNO 
attribute, with every 10 values in a different partition 
scope. In this example, then, the partition ed acces s 
request P. would be EMPLOYEE^DEPrNO BETWEEN 
1 AND 10, and EMPLOYEE-DEPTNO BETWEEN 11 
and 20, etc. 

9. QP then makes M^ cojnes of the original request R (stop 
1216) and attaches to die id» copy a WHERE clause wiUj 
the conjunction of die original predicates if any plus die 
Phd jj, associated with the idi partition scope of T^ (step 
1218). The result of diis attachment is denoted die modi- 
fied request R^. 

Using as an example the query given in the background 
section, the resulting R^ request will have the form: 



SEUSCT NAME, SALARY; DEFTNO 

PROM EMFLOYEE, DEFT 

WHERE EMFLOYEEENO = DOT.MORENO 

AND EMPLOYEBJOTTNO BETWEEN 1 AND 10 

ORDER BY 3 



30 



Requests Rj ttirough will have a similar form, wiUi 
different values for die scope of the partition (e.g., 11 
dirough 20). 

The above steps 1-9 are performed in the partition phase 
520 (FIGS. 5 and 7) of query processing. 

10. Each of die Mj modified requests is sent to a processing 
unit 104 for execution using mechanisms that do not form 
part of this invention. This step is performed in the 
transfer lAase 530 (FIGS. 5 and 8) of query processing. 

11. Following die join phase 540 (FIGS. 5 and 9). as each of 
the M^ modified requests returns its result to the QP, the 
QP peifOTms further activities to meige the results back 
into a form expected by the requester (step 1222). Such 
further activities could Involve soiting. aggregation, and 
other processing, for cMmple. 

Further modifications can be made to the above procedure 
40 to adjust for the choice of access path or overall execution 
plan for accessing the requested data. For example, an 
examination after step 9 of the resulting plans for each of the 
M^ requests may indicate that an unfavmble access path 
catii >uwii I J wui uaY%^ a ui.^^ ^Jx ^.w^woa^ was sdcctcd Of au analysis of the overall costs of each R; 

partitioned access request Aprcfctred means is to capture 45 may indicate diat Tj was not a suitable candidate table for 
this partitioned access inquest by use cf a predicate or splitting; for cxaiiq>lc, a heuristic that die sum of die overall 
predicates, denoted by P^ that can be attached to die costs of Rj,!^, ... R^^, is more ttian 2 times die overall cost 
original request in order to narrow tiiat request to a single of die original request R could be used. In dicse cases. T^ 
partition of T^ may be rejected as die table to use for partitioning, and list 

7. The QP determines, using mechanisms that do not form 50 L may be modified to include only those L/s that follow the 
part of tiiis invention, how many partitions Tj should be L, that represented Ty. Then die process of selecting anodicr 
split into, and how to describe the partition scope (step table for splitting is begun, starting at step 3 above. 
1212). Sudi mechanisms might inchide a predetermined Unlike other possible mechanisms, the procedure 
number from die QP or database instance catalogs, die described above e^lcrits cost-based knowledge about die 
number of working processing units, die number of daU- 55 request Furthermore, the procedure allows for the possibil- 



base instances available, or a computation based on 
current system load across all of die [Ht)ces5ing units or 
database instances. Once selected, the number of parti- 
tions for Tj is denoted as My. (My corresponds to MP in the 
discussion further above.) « 
8. Given Tj and My. the QP determines the scope of each 
partition, using mechanisms diat do not form part of this 
invention (step 1214). One such mechanism might be a 
list of value ranges for a partitioning index on table Ty in 
the database or QP catalog; another might be ranges of 65 
row IDs defined by an equal number of rows per My 
partitions. The My instances of partition scope will be 



ity of using the database itself in calculating the costs of die 
aitcmative tables in the original request, which will gener- 
ally provide a better choice of table for splitting than such 
other mechanisms which are independent of the database 
engine. 

The procedure is c^>able of using a variety of partitioning 
access mechanisms, such as indexes or storage partitions, 
not singly row IDs. Because the procedure allows value- 
based partition ranges, which may include values in multiple 
attributes, it is more flexible, allowing nxire oj^xvtunity for 
optimization and therefore potentially better performance. 
Because the procedure e]q)re35es the split requests in terms 
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of possibly con^lex {H^cdicates that arc mcffc nonprooedural 
than TOW IDs and less dependent on the physical storage of 
data by the underlying database instances, the procedure will 
allow for further transformations and processing, including 
but not limited to further paititi(»iing and parallelizing 5 
and/or distributing die processing of each modified 
request instance. 

Fuithermore. using value-based predicates has an inherent 
advantage over physical addressing schemes such as those 
that split requests based on row IDs. This is because the use lo 
of value-based predicates can be combined with language 
processors that peiform transitive closure on predicates to 
result in splitting requests on possibly more than one table 
at a time without an exponential number of resulting 
requests. For exan^le, in a join request that links 4 tables on is 
the same attribute (e.g., tlJi=t2.x and t2.*=t3ji and t3Jt= 
t4 jc) the work on all four tables will be split if the request 
is split on the x attribute of any of the tables. 



Conclusion 



20 
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Various modifications may be noade to the system 
described above, as described^ for example, in the concur- 
rently filed application entitled *Task Scheduler for a Mul- 
tiprocessor System,- now U.S. Pat No. 5»437,032. Thus, as 
already noted, each of the elements denominated as a 
*'processor" may be a tightly coupled processor complex 
rather than a uniprocessor. Further, each processor 
(uniprocessor or processor conq)lex) may support a plurality 
of concurrently executing tasks rather than only a single task 
as described above. In such a case, each processor would be 
initially assigned a plurality of tasks, in accordance with &e 
desired mult^rogramming leveL rather than only a single 
task as described above, and would be assigned new tasks as 
necessary to maintain such desired mult^ogramming level. 

In addition, in assigning awaiting tasks, other factors such 
as the '^affinity"* of a task for a particular processor may be 
taken into account. Also, In a system In which mulc^le 
queries are being processed concurrently^ tasks from differ- 
ent queries might be prioritized and executed concuiTently in ^ 
accordance with a desired scheme for ensuring 'fairness** 
and avoiding undue starvation of low-priority tasks. Still 
other modifications will be apparent to &o$e skilled in the 
art 

What is claimed is: 

1. In a system having a plurality of processors, a plurality 
<^ partially shared storage facilities, and a universal storage 
facility, each of said partially shared storage fadUties toeing 
shared by a particular subset of said plurality of processors, 
each of said processors sharing at least one of said partially ^ 
shared storage facilities, said universal storage facility being 
shared by all of said processors, a method of performing a 
parallel operation on a data set comprising the stq>s of: 

(a) partitioning said operation into a plurality of indepen- 
dent tasks having different estimated completion times, S5 
each of said tasks being restricted to a corresponding 
subset of said data set; 

(b) assigning a predetermined subset of said plurality of 
tasks having smaller estimated coiiq>letion times to said 
universal storage facility; 6o 

(c) assigning the remaining tasks of said plurality of tasks 
to said partially shared storage facilities, said remaining 
tasks having larger estimated completion times; 



(d) initially assigning each of said processors a task from 
a partially shared storage facility shared by said pro- 
cessor for performance of said operation thereon; 

(c) upon the completion of a task by one of said 
processors, assigning that processor an awaiting task 
from a partially shared storage facility shared by said 
processor if any awaiting tasks remain in said partially 
shared storage facility, otfaervise, assigning said pro- 
cessor an awaiting task from the universal storage 
facility; and 

(f) rq^cating the preceding step (e) until each of said tasks 
has been assigned to one of said fcoccssors. 

2. The metfiod of claim 1 wherein upon the completion of 
a task by one of said processors, that processor is assigned 
an awaiting task from a partially shared storage facility 
shared by said processor or from the universal storage 
facility having the longest estimated con9>lctioD time. 

3. The method of claim 1 wherein each of said partially 
shared storage facilities is shared by a plurality of proces- 
sors, 

4. In a system having a plurality of processor clusters* 
each of said processor clusters containing one or more 
processors and a duster storage facility shared by the 
processors of that cluster, and a universal duster comprising 
a shared storage facility shared by all of said processors, a 
method of perfcvming a paialld operation on a data set 
comprising the steps of: 

(a) paititioaing said operation into a plurality of indepen- 
dent tasks, each of whidi is restricted to a correspond- 
ing subset of said data set; 

(b) assigning a predetermined subset of said plurality of 
independent tasks to said universal duster; 

(c) assigning the remaining tasks of said plurality of 
independent tasks to said processor clusters; 

(d) initially assigning each of said processors a task from 
the correspoading processor duster for performance of 
said conation thereon; 

(e) upon the completion of a task by one of said 
processors, assigning that processor an awaiting task 
from the corresponding processor clusto* if any await- 
ing tasks remain in said processcx* duster, otherwise, 
assigning the processor an awaiting task from the 
universal clusta; and 

(f) rq>eating the preceding stq> (f) until each of said tasks 
has been assigned to one of said processors. 

5. The method of claim 4 wherein each of said processco* 
clusters contains a plurality of processors, 

6. The nKthod of daim 4 wherein said st^s of assigning 
tasks to said clusters comprise the step of transferring the 
c<HTesponding subsets of said data set to said dusters. 

7. The method of dalm 4 wherein tasks having smaller 
estimated conoplction times arc assigned to said universal 
cluster while tasks having larger estimated completion times 
arc assigned to said processor clusters. 

8. The method of daim 4 wherein said processors arc 
assigned awaiting usks havbg the longest estimated 
conopletion times. 
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